YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. This dataset is a daily record of the top trending YouTube videos.
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the USA, with up to 200 listed trending videos per day.
More details about this dataset are present in About_dataset.txt included with this project
Before we move on to plotting and analyzing the data, let us see if the data requires any cleaning.
The first 2 rows of each of the columns of the dataset are as follow:
## video_id trending_date
## 1 2kyS6SvSYSE 17.14.11
## 2 1ZAPwfrtAFY 17.14.11
## title
## 1 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## channel_title category_id publish_time
## 1 CaseyNeistat 22 2017-11-13T17:13:01.000Z
## 2 LastWeekTonight 24 2017-11-13T07:30:00.000Z
## tags
## 1 SHANtell martin
## 2 last week tonight trump presidency|last week tonight donald trump|john oliver trump|donald trump
## views likes dislikes comment_count
## 1 748374 57527 2966 15954
## 2 2418783 97185 6146 12703
## thumbnail_link comments_disabled
## 1 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg False
## 2 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg False
## ratings_disabled video_error_or_removed
## 1 False False
## 2 False False
## description
## 1 SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\\nwith this lens -- http://amzn.to/2rUJOmD\\nbig drone - http://tinyurl.com/h4ft3oy\\nOTHER GEAR --- http://amzn.to/2o3GLX5\\nSony CAMERA http://amzn.to/2nOBmnv\\nOLD CAMERA; http://amzn.to/2o2cQBT\\nMAIN LENS; http://amzn.to/2od5gBJ\\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\\nBIG Canon CAMERA; http://tinyurl.com/jn4q4vz\\nBENDY TRIPOD THING; http://tinyurl.com/gw3ylz2\\nYOU NEED THIS FOR THE BENDY TRIPOD; http://tinyurl.com/j8mzzua\\nWIDE LENS; http://tinyurl.com/jkfcm8t\\nMORE EXPENSIVE WIDE LENS; http://tinyurl.com/zrdgtou\\nSMALL CAMERA; http://tinyurl.com/hrrzhor\\nMICROPHONE; http://tinyurl.com/zefm4jy\\nOTHER MICROPHONE; http://tinyurl.com/jxgpj86\\nOLD DRONE (cheaper but still great);http://tinyurl.com/zcfmnmd\\n\\nfollow me; on http://instagram.com/caseyneistat\\non https://www.facebook.com/cneistat\\non https://twitter.com/CaseyNeistat\\n\\namazing intro song by https://soundcloud.com/discoteeth\\n\\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics.
## 2 One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\\n\\nConnect with Last Week Tonight online...\\n\\nSubscribe to the Last Week Tonight YouTube channel for more almost news as it almost happens: www.youtube.com/user/LastWeekTonight\\n\\nFind Last Week Tonight on Facebook like your mom would: http://Facebook.com/LastWeekTonight\\n\\nFollow us on Twitter for news about jokes and jokes about news: http://Twitter.com/LastWeekTonight\\n\\nVisit our official site for all that other stuff at once: http://www.hbo.com/lastweektonight
In many of the following data cleaning steps only the code but not the output is printed to prevent repeated printing of the same dataset with minor modifications. The final dataset obtained after the data cleaning is printed at the end of this section.
As we are only interested in exploratory analysis of the data, we remove the columns of tags, thumbnail_link and description since they are irrelevant to us.
## video_id trending_date
## 1 2kyS6SvSYSE 17.14.11
## 2 1ZAPwfrtAFY 17.14.11
## 3 5qpjK5DgCt4 17.14.11
## 4 puqaWrEC7tY 17.14.11
## 5 d380meD0W0M 17.14.11
## 6 gHZ1Qz0KiKM 17.14.11
## title
## 1 WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3 Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4 Nickelback Lyrics: Real or Fake?
## 5 I Dare You: GOING BALD!?
## 6 2 Weeks with iPhone X
## channel_title category_id publish_time views
## 1 CaseyNeistat 22 2017-11-13T17:13:01.000Z 748374
## 2 LastWeekTonight 24 2017-11-13T07:30:00.000Z 2418783
## 3 Rudy Mancuso 23 2017-11-12T19:05:24.000Z 3191434
## 4 Good Mythical Morning 24 2017-11-13T11:00:04.000Z 343168
## 5 nigahiga 24 2017-11-12T18:01:41.000Z 2095731
## 6 iJustine 28 2017-11-13T19:07:23.000Z 119180
## likes dislikes comment_count comments_disabled ratings_disabled
## 1 57527 2966 15954 False False
## 2 97185 6146 12703 False False
## 3 146033 5339 8181 False False
## 4 10172 666 2146 False False
## 5 132235 1989 17518 False False
## 6 9763 511 1434 False False
## video_error_or_removed
## 1 False
## 2 False
## 3 False
## 4 False
## 5 False
## 6 False
## [1] 23362 13
From the dimensions of the dataframe in the above output, we have the data for 23,362 videos(assuming there are no duplicates across 13 features; we shall investigate this the following sections). Hence, we use the data.table data structure to store our data as it superior to a dataframe for add, remove, update, join etc. operations.
yt_trending <- data.table(yt_trending)
Further we observe that, we have names of each of the category_id (as part of US_category_id.json file). Hence, adding the column for category_name.
category_id <- c(1,2,10,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,
34,35,36,37,38,39,40,41,42,43,44)
category_name <- c("Film & Animation","Autos & Vehicles","Music",
"Pets & Animals","Sports","Short Movies",
"Travel & Events","Gaming","Videoblogging",
"People & Blogs","Comedy","Entertainment",
"News & Politics","Howto & Style","Education",
"Science & Technology",
"Nonprofits & Activism","Movies","Anime/Animation",
"Action/Adventure","Classics","Comedy",
"Documentary","Drama","Family","Foreign","Horror",
"Sci-Fi/Fantasy","Thriller","Shorts","Shows","Trailers")
category_id_names <- data.frame(category_id,category_name)
yt_trending <- merge(yt_trending,category_id_names)
yt_trending <- yt_trending[order(yt_trending$views,decreasing = TRUE),]
The number of videos each channel has in the trending videos is added to the dataset.
number_of_videos_trending_channel_title <-
(count(yt_trending,yt_trending$channel_title))[1]
number_of_videos_trending_count <-
(count(yt_trending,yt_trending$channel_title))[2]
number_of_videos_trending <- data.table(number_of_videos_trending_channel_title,
number_of_videos_trending_count)
setnames(number_of_videos_trending,old="yt_trending$channel_title",
new="channel_title")
setnames(number_of_videos_trending,old="n",new="number_of_videos_trending")
yt_trending <- merge(yt_trending,number_of_videos_trending)
We are most interested here in analyzing the trends in the trending videos, and we do not have a columnn to indicate the number of days a video was on trending, we add that column here.
days_on_trending_video_id <- (count(yt_trending,yt_trending$video_id))[1]
days_on_trending_num_of_days <- (count(yt_trending,yt_trending$video_id))[2]
days_on_trending <- data.table(days_on_trending_video_id,
days_on_trending_num_of_days)
setnames(days_on_trending,old="yt_trending$video_id",new="video_id")
setnames(days_on_trending,old="n",new="days_on_trending")
yt_trending <- merge(yt_trending,days_on_trending,by="video_id")
yt_trending <- yt_trending[order(yt_trending$days_on_trending,
yt_trending$views,
decreasing = TRUE),]
As per the requirement in the Bivariate Plots section for analysing the top trending channels.
yt_trending_by_number_of_videos_trending <-
yt_trending[order(yt_trending$number_of_videos_trending,
yt_trending$views,
decreasing = TRUE),]
## [1] 4712 16
## [1] "video_id" "channel_title"
## [3] "category_id" "trending_date"
## [5] "title" "publish_time"
## [7] "views" "likes"
## [9] "dislikes" "comment_count"
## [11] "comments_disabled" "ratings_disabled"
## [13] "video_error_or_removed" "category_name"
## [15] "number_of_videos_trending" "days_on_trending"
## video_id channel_title
## 00nmxR1mxIA: 1 The Tonight Show Starring Jimmy Fallon: 51
## 00RpZZThSAs: 1 ESPN : 46
## 01AEuxSlIMg: 1 TheEllenShow : 44
## 02e9klKUN0Y: 1 Jimmy Kimmel Live : 42
## 02N508BDngc: 1 Netflix : 42
## 032BPsxhreM: 1 The Late Show with Stephen Colbert : 41
## (Other) :4706 (Other) :4446
## category_id trending_date
## Min. : 1.00 18.12.03: 199
## 1st Qu.:17.00 18.09.01: 141
## Median :24.00 18.01.02: 84
## Mean :20.44 17.13.12: 70
## 3rd Qu.:25.00 17.14.11: 69
## Max. :43.00 17.22.11: 68
## (Other) :4081
## title
## DORITOS BLAZE vs. MTN DEW ICE | Super Bowl Commercial with Peter Dinklage and Morgan Freeman: 2
## Justice League - Movie Review : 2
## Maroon 5 - Wait : 2
## Missouri Star Quilt Company Live Stream : 2
## NBA Bloopers - The Starters : 2
## Selena Gomez, Marshmello - Wolves : 2
## (Other) :4700
## publish_time views likes
## 2017-11-17T05:00:00.000Z: 4 Min. : 559 Min. : 0
## 2017-11-17T05:00:01.000Z: 3 1st Qu.: 95075 1st Qu.: 1600
## 2017-12-13T15:00:01.000Z: 3 Median : 331606 Median : 7726
## 2018-01-12T05:00:01.000Z: 3 Mean : 1277663 Mean : 39715
## 2018-02-16T14:00:03.000Z: 3 3rd Qu.: 1025326 3rd Qu.: 25876
## 2017-11-10T05:00:01.000Z: 2 Max. :149376127 Max. :3093544
## (Other) :4694
## dislikes comment_count comments_disabled ratings_disabled
## Min. : 0 Min. : 0.0 False:4633 False:4686
## 1st Qu.: 79 1st Qu.: 238.0 True : 79 True : 26
## Median : 302 Median : 888.5
## Mean : 2598 Mean : 4975.6
## 3rd Qu.: 1058 3rd Qu.: 2914.2
## Max. :1674420 Max. :1361580.0
##
## video_error_or_removed category_name number_of_videos_trending
## False:4711 Entertainment :1141 Min. : 1.00
## True : 1 Music : 585 1st Qu.: 8.00
## News & Politics: 438 Median : 24.00
## Howto & Style : 436 Mean : 37.58
## Comedy : 390 3rd Qu.: 62.00
## People & Blogs : 368 Max. :114.00
## (Other) :1354
## days_on_trending
## Min. : 1.000
## 1st Qu.: 3.000
## Median : 5.000
## Mean : 4.958
## 3rd Qu.: 7.000
## Max. :14.000
##
As we can see from the above output, we have the data of 4,712 videos and their 15 features.
## Action/Adventure Anime/Animation Autos & Vehicles
## 0 0 66
## Classics Comedy Documentary
## 0 390 0
## Drama Education Entertainment
## 0 186 1141
## Family Film & Animation Foreign
## 0 237 0
## Gaming Horror Howto & Style
## 57 0 436
## Movies Music News & Politics
## 0 585 438
## Nonprofits & Activism People & Blogs Pets & Animals
## 13 368 116
## Science & Technology Sci-Fi/Fantasy Short Movies
## 308 0 0
## Shorts Shows Sports
## 0 2 320
## Thriller Trailers Travel & Events
## 0 0 49
## Videoblogging
## 0
Plotting the number of videos as per their category name, we can see a large amount of variation in the number of videos in each category. We can clearly observe that Entertainment and Music are the top 2 categories with 1141 and 585 videos respectively closely followed by Howto & Style with 436 videos. We can also see their respective shares out of the total.
Also, some categories like Action/Adventure, Anime/Animation etc. have no videos.
## [1] "Summary of views feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 559 95070 331600 1278000 1025000 149400000
## [1] "Summary of likes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1600 7726 39720 25880 3094000
## [1] "Summary of dislikes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 79 302 2598 1058 1674000
## [1] "Summary of comments feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 238.0 888.5 4976.0 2914.0 1362000.0
This plot shows the log10 of the number of views, likes, dislikes and comments on the videos.
We can observe a distribution similar to that of a normal distribution. Further, we can see that, the mean number of views is more that that of any other attribute.
The mean, median and maximum number of view are 1,278,000, 331,600 and 149,400,000 respectively.
The mean, median and maximum number of likes are 39,720, 7,726 and 3,094,000 repectively.
The mean, median and maximum number of dislikes are 2,598, 302 and 1,674,000 respectively.
The mean, median and maximum number of comments are 4,976, 888 and 1,362,000 respectively.
## [1] "Summary of videos with comments disabled"
## False True
## 4633 79
## [1] "Summary of videos with ratings disabled"
## False True
## 4686 26
So we have only 79 videos which have their comments disabled and only 26 videos with their ratings disabled.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.958 7.000 14.000
We can observe a bimodal type of distribution for the days on trending.
With videos trending for an average of about 5 days and a maximum of 14 days.
Videos of some categories like Auto & Vehicles, Comedy etc. are not even present in the top 100 trending videos.
## [1] "Summary of views feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192400 792100 1665000 3411000 2939000 45940000
## [1] "Summary of likes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 512 19830 49440 105600 131500 822000
## [1] "Summary of dislikes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 146.0 779.2 1600.0 5052.0 3766.0 165100.0
## [1] "Summary of comments feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1815 4667 11980 10680 203900
For the top 100 trending videos we cannot see a definitive distribution as we could for these same features for the entire distribution.
The mean, median and maximum number of view are 3,411,000, 1,665,000 and 45,940,000 respectively.
The mean, median and maximum number of likes are 105,600, 49,440 and 822,000 repectively.
The mean, median and maximum number of dislikes are 5,052, 1,600 and 165,100 respectively.
The mean, median and maximum number of comments are 11,980, 4,667 and 203,900 respectively.
Comparing these statistics to those for the entire distribution, all of these statistics are higher than those for the entire distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 12.00 13.00 12.59 13.00 14.00
Videos in the Top 100 trending videos trend for an average of 13 days with a maximum of 14 days.
The dataset after preprocessing to remove duplicates, contains the data for 4712 videos and 15 features (video_id, category_id, trending_date, title, channel_title, publish_time, views, likes, dislikes, comment_count, comments_disabled, ratings_disabled, video_error_or_removed, category_name, days_on_trending)
Unordered factors: category_id, category_name.
category_id: 1,2,10,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33, 34,35,36,37,38,39,40,41,42,43,44
category_name: “Film & Animation”,“Autos & Vehicles”,“Music”, “Pets & Animals”,“Sports”,“Short Movies”, “Travel & Events”,“Gaming”,“Videoblogging”, “People & Blogs”,“Comedy”,“Entertainment”, “News & Politics”,“Howto & Style”,“Education”, “Science & Technology”, “Nonprofits & Activism”,“Movies”,“Anime/Animation”, “Action/Adventure”,“Classics”,“Comedy”, “Documentary”,“Drama”,“Family”,“Foreign”,“Horror”, “Sci-Fi/Fantasy”,“Thriller”,“Shorts”,“Shows”,“Trailers”
Other observations:
The number of views (views), number of days the video is trending (days_on_trending) and category of the video (category_name) are the main features of interest here.
Although subject to the viewers’ biases, number of likes, dislikes and comments can also help in understanding the video’s position on the list of trending videos.
Yes, the variable category_name was created to better interpret the category_id feature which has a direct correspondance with category_name.
Also, the days_on_trending variable was created to keep track of the number of days the video was on trending.
One of the unusual observations were that there were no videos from some of the categories like Action/Adventure, Anime/Animation etc.
Some features like tags, thubnail_link and description were removed from the dataset as they took a lot of space on printing and were irrelevant to our analysis.
Log10 transform was applied in multiple plots to convey the scale and variation in the data as required.
Also, the same video which as trending on nultiple days was reduced to a single entry on the day it had the highest views.
The above correlation matrix helps in identifying some of the interesting trends in the data.
We have a high correlation between
But, before we plot scatter plots to visualize these correlations, we have to normalize the data ranges of the above mentioned four features.
After normalizing to be in the range of [0,1]. We get the following output:
## views likes dislikes comment_count
## 1: 0.117422509 0.122986452 1.236070e-02 0.0213883870
## 2: 0.015345060 0.072727590 5.189260e-03 0.0304550596
## 3: 0.007477869 0.002425697 3.487775e-04 0.0002379588
## 4: 0.006732219 0.015137654 4.240274e-04 0.0070895577
## 5: 0.005249547 0.003901351 7.608605e-04 0.0010671426
## 6: 0.004773304 0.004023864 8.719437e-05 0.0010825658
Using suitable limits for the X and Y axis:
We can clearly observe the correlation that we found out previously using the correlation matrix.
We can see the variation in the features of likes, dislikes, comment count and days on trending in the following plots.
The notable but not considerable correlations like in between views and dislikes and that between views and comment count are visible here.
As the number of views increases, the other features also increase which is in agreement with our calculated statistics in the univariate plots section.
Moreover, only a small number of videos trend for 10 days or more.
Before we compare how the different features vary across the different categories, we take a look at the top 25 trending channels.
ESPN, Vox and Netflix are the top 3 channels (in that order) in terms of number of trending videos.
How these trending channels vary across the categories is something that we will take a look at in the multivariate plots section.
## [1] "Summary for views feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 559 95070 331600 1278000 1025000 149400000
## [1] "Summary for likes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1600 7726 39720 25880 3094000
## [1] "Summary for dislikes feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 79 302 2598 1058 1674000
## [1] "Summary for comment count feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 238.0 888.5 4976.0 2914.0 1362000.0
## [1] "Summary for days on trending feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.958 7.000 14.000
## [1] "Summary for number of trending videos feature:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 24.00 37.58 62.00 114.00
The categories of Entertainment and Music have very high values for all the features across the board.
Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.
Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.
Overall, it seems that Shows is one category that consistently appears on trending.
In the univariated plots and analysis sections, we observed that the trends in the entire dataset were more pronounced in the Top 100 Trending Videos.
We check if this is true for the correlations that we observed at the beginning of the bivariate plots section.
The changes in these correlations from those of the entire dataset are:
Again before plotting, we first normalize the values of the features to be in the range [0,1].
## views likes dislikes comment_count
## 1: 0.37922890 0.462533127 0.124579451 0.142836123
## 2: 0.04591265 0.273262573 0.051787371 0.203385258
## 3: 0.02022370 0.008511685 0.002655141 0.001589139
## 4: 0.01778891 0.056383824 0.003418948 0.047345549
## 5: 0.01294750 0.014068870 0.006837897 0.007126601
## 6: 0.01139241 0.014530244 0.000000000 0.007229601
We can observe the correlation that we found from the correlation matrix above.
The data points are further away from the y=x line and hence overall, with views the other features are less correlated.
Also, videos in the Top 100 Trending Videos trend for either 12, 13 or 14 days only.
In this part, we try to find how the various features vary in the Top 100 Trending Videos vary from those in the entire dataset.
The plots from Top 100 Trending Videos are on the right and those from the entire dataset are on the left.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192400 792100 1665000 3411000 2939000 45940000
The most notable category in this comparision is Sports which has a low median value in the entire dataset but the highest median, 1st and 3rd quantile values in the Top 100 Trending Videos.
Surprisingly, the Shows category has completely disappeared in the Top 100 Trending Videos set.
Music is the category with the second highest median value in the Top 100 Trending Videos closely followed by Film & Animation and Comedy.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 512 19830 49440 105600 131500 822000
Sports, Music and People & Blogs pull quite ahead of other categories in terms of views with both some of the highest median and 3rd quantile values.
Also, across most of the categories, the median number of views is higher in the Top 100 Trending Videos than those in the entire dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 146.0 779.2 1600.0 5052.0 3766.0 165100.0
As we can see in the boxplot on the left for the entire datset, most of the categories have their median number of dislikes close to each other. But, in the Top 100 Trending Videos, we see a sharp decrease in it for the categories like Education, Pets & Animals, News & Politics etc.
While for the popular categories like Entertainment and Music the median value of dislikes is one of the highest. Number of views and dislikes had a correlation of 0.71, so this is not very surprising.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1815 4667 11980 10680 203900
Again, Sports pulls ahead of other categories with Gaming having the second highest value in the Top 100 Trending Videos set. Also there is a sharp decrease in the category of Education.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 12.00 13.00 12.59 13.00 14.00
The subsetting of the data is more readily visible for this feature that for any other feature with nearly all of the values for the categories being either 12, 13 or 14.
Surprisingly, Pets & Animals is the category which trends for the longest time in the Top 100 Trending Videos. Whereas, the Sports category which was leading up until now in all the features has one of the lowest median value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 14.00 25.00 31.83 43.25 109.00
There is significant decrease in the number of trending videos of the Sports category between the entire dataset and the Top 100 Trending Videos.
While, the Comedy category maintains its domination even in the Top 100 Trending Videos set with the highest median and 3rd quantile value closely followed by News & Politics and Entertainment.
We have a high correlation between the number of views and number of likes both in the entire dataset (0.83) and in the Top 100 Trending Videos (0.85).
We also have a high correlation between the number of dislikes and comment count in the entire dataset (0.83) but it drastically reduces in the Top 100 Trending Videos (0.36).
The variation of the different features across categories varies quite widely depending upon the feature under investigation.
Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.
Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.
Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.
Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.
In the Top 100 Trending Videos, the trend of Entertainment and Music having values across all the features still continues but is a lot less pronounced with other categories like Film/Animation and People & Blogs coming close.
Pets & Animals is surprising the category with the highest median, 1st and 3rd quantile values in the Top 100 Trending Videos.
In the Top 100 Trending Videos subset of the dataset, the Pets & Animals category has the highest median, 1st and 3rd quantile values.
We have already seen the top trending channels. Here, we take a look at those channels and see how they are distributed across the different categories.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.00 88.00 98.00 98.48 108.00 114.00
All of these channels post videos only of a specific category as we can see no channel having bars of different colours.
Furthermore, as we saw in the bivariate plots section, the most common categories are Entertainment and Sports.
ESPN having the highest number of trending videos (114). Both the mean and median of number of trending videos being 98.
In the bivariate plots section, we observed how the various features vary across different categories of videos. Now in this section, we can explore how these features vary across the Top 25 trending channels and see the classifications and conclusions of the bivariate plots section in action.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 297200 2390000 2941000 3874000 4101000 13030000
ESPN, NBA, Netflix and Vox are the top channels in terms of number of videos on trending. But, NFL, WIRED and WWE are very ahead of these trending channels in terms of number of views.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3377 28130 50840 78150 102800 436500
Again, ESPN, NBA, Netflix and Vox are the top channels in terms of number of videos on trending. But, First We Feast, NFL, The Tonight Show Starring Jimmy Fallon and WIRED, and WWE are higher in terms of number of likes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217 1761 3218 12300 7591 132400
Here we see, even NFL is coming close to the above four channels that we see topping the plots previously. We see a massive spike in the number of dislikes for Washington Post (132400) a News & Politics channel which is a order of magnitude higher than the median. Although, again, ESPN and NFL are still in the top 3.
(Considering the mean for comparision is not a good decision for this plot as there is a very large outlier to skew the mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 852 2771 7572 11560 18850 35910
Again, we see large spikes for CNN, WIRED and Washington Post all of which are News & Politics channels.
(The maximum value is again an order of magnitude higher than the median value)
Hence, continuing the trend of people expressing their wide range of opinions not only through dislikes but also through comments.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 5.00 4.64 6.00 8.00
Although, the domination of ESPN, NBA, Netflix and Vox continues in terms of number of videos on trending, we see that only NBA channel’s videos trend for a time longer than the average.
Also, channel’s like ABC News, Bon Appetit, Great Big Story and WIRED all produce videos that trend for a longer time than the channel’s mentioned above.
Looking at the top trending channels in terms of the number of videos trending, the same four channels ESPN, NBA, Netflix and Vox showed consistent domination across all the features except number of days on trending.
News & Politics channels in general, had very large values for features like dislikes, comment count and days on trending indicating strongs opinions of the people on the content posted by these channels.
Features like likes, views etc. were generally higher for channels posting videos of the categories of Entertainment and Music.
Features like dislikes, comment count and days on trending were generally orders of magnitude higher for News & Politics channels.
We visualize the correlations between views and likes (both normalized; since numerically the number of views is always higher in the dataset) on the left and that between dislikes and comment count (also normalized) on the right.
One important trend we can observe from these correlations is that if a video has a high number of views then it also has a high number of likes indicating (not inferring) that the chance of being being popular due to bad publicity is low.
From the second plot, we can see that if a video has a high number of dislikes then it also tends to have a high comment count indicating that the video may be controversial and hence the people are more likely to leave a comment.
Of course, these claims are just based on correlations and a hypothesis test is required to prove any causation.
## [1] "IQR of Shows for views:"
## yt_trending$category_name == "Shows": FALSE
## [1] 930800.2
## --------------------------------------------------------
## yt_trending$category_name == "Shows": TRUE
## [1] 53558
## [1] "IQR of Shows for likes:"
## yt_trending$category_name == "Shows": FALSE
## [1] 24284.75
## --------------------------------------------------------
## yt_trending$category_name == "Shows": TRUE
## [1] 1816.5
## [1] "IQR of Shows for dislikes:"
## yt_trending$category_name == "Shows": FALSE
## [1] 979
## --------------------------------------------------------
## yt_trending$category_name == "Shows": TRUE
## [1] 67
## [1] "IQR of Shows for comment count:"
## yt_trending$category_name == "Shows": FALSE
## [1] 2676.75
## --------------------------------------------------------
## yt_trending$category_name == "Shows": TRUE
## [1] 867
## [1] "IQR of Shows for days on trending:"
## yt_trending$category_name == "Shows": FALSE
## [1] 4
## --------------------------------------------------------
## yt_trending$category_name == "Shows": TRUE
## [1] 3
Specifically, focussing on the Shows category of videos, we can see that of all the categories of videos across all of the features they have the least IQR value but also trend for the longest duration of time as can be seen from from the plots.
Between the two plots, we can see that videos posted by Washington Post both have the highest number of dislikes and one of the highest number of comments.
Looking at other channels in the above plots with notable values, ESPN, NFL, CNN and WIRED have some of the highest values for number of comments.
An interesting point to note is that, all of these channels post videos excludively of either the News & Politics or Sports category which are categories having supporters with different viewpoints and opinions and hence a high possibility of disagreement and this is clealy visible in the distribution of the data in the above plots.
The YouTube Trending Statistics is a daily record of the top trending videos on the video sharing platform YouTube. The data used is that from the USA. It contains data about 23,362 (4,712 unique) videos across 13 features.
I began the analysis process by first preprocessing the data to remove some of the features and converting the dataset into a data.table type for quick access, merging etc. of the data. Some features like days_on_trending and number_of_trending_videos were derived from the data present in the dataset and added to it.
The analysis was carried out both over the entire dataset and over the Top 100 Trending Videos (a subset of the entire dataset). In both cases, views and likes had a high correlation but dislikes and comment count had a high correlation only over the entire dataset but not in the Top 100 Trending Videos.
Entertainment and Music categories dominate the Top 100 Trending Videos in terms of the values of various features like views, likes, comment count and days on trending. Also, videos of the category Shows trend for the longest time.
Further, looking at the top trending channels, positively indicative features like views and likes were high for the channels posting Entertainment and Music category of videos. While, features indicating controvery were high for channels posting videos of the News & Politics category.
This trend is easily observed online. Politics tends to be a category that generates a lot of controvery as people with different opinions come in contact. Whereas, music is in general a category that a majority of viewers have a similar opinion about.
This facet of online interactions is quite accurately captured by this analysis.
I also tried to include a feature measuring the number of days a video took from its date of upload to start trending but ultimately could not due to not being able to preprocess the date format given in the dataset.
In addition to the analysis performed here, we could do a sentiment analysis using the title of the videos, video description and tags of the video. This can help in more accurately predicting the different features of the videos. This type of analysis can help understand how the perception of the audience varies over different types of content and consequently help in the development of future content.